Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: making is_nullable ansi aware for sum_decimal and avg_decimal #981

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

vaibhawvipul
Copy link
Contributor

Which issue does this PR close?

Closes #961 .

Rationale for this change

making is_nullable ansi aware for sum_decimal and avg_decimal

What changes are included in this PR?

Updated planner.rs, sum_decimal.rs and avg_decimal.rs

How are these changes tested?

CI tests pass

@viirya
Copy link
Member

viirya commented Sep 30, 2024

Triggered CI.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the error thrown when ansi_mode is enabled and an overflow occurs?

@vaibhawvipul
Copy link
Contributor Author

vaibhawvipul commented Oct 1, 2024

Where is the error thrown when ansi_mode is enabled and an overflow occurs?

As per issue-ticket's description - SumDecimal currently hardcodes nullable=true, and this is correct when ANSI mode is not enabled, because overflows cause null values. However, in ANSI mode, overflows cause exceptions. If the input is non-nullable then SumDecimal should probably also be non-nullable. The same is true for AvgDecimal.

I thought the scope of this work is only to make nullable either true or false based on ANSI mode. @parthchandra

@andygrove
Copy link
Member

Thanks for working on this @vaibhawvipul.

I checked Spark's source code in the master branch, and it also hard codes nullable as true. This makes sense because if we perform a sum on a column that only contains nulls, then we would expect a null output.

It would be good to add a test in CometAggregateSuite to confirm that we have the same behavior as Spark in this case.

Parth raises a good point that we should check that we have the behavior for the ANSI overflow case, but that can be a separate issue/PR.

@parthchandra
Copy link
Contributor

Where is the error thrown when ansi_mode is enabled and an overflow occurs?

As per issue-ticket's description - SumDecimal currently hardcodes nullable=true, and this is correct when ANSI mode is not enabled, because overflows cause null values. However, in ANSI mode, overflows cause exceptions. If the input is non-nullable then SumDecimal should probably also be non-nullable. The same is true for AvgDecimal.

I thought the scope of this work is only to make nullable either true or false based on ANSI mode. @parthchandra

Oh, I didn't realize the scope for this PR was limited. I wonder if we may end up with incorrect results if we change the nullability for ansi mode but do not throw an exception on overflow.
If an expression result is nullable, then the corresponding vector has a validity bit and downstream operations would check it before accessing the value. If the expression is not nullable, then a downstream operation can eliminate the check for nullability and access the value directly. But in this case the value would be an overflow value which would be incorrect.

It would be good to add a test in CometAggregateSuite to confirm that we have the same behavior as Spark in this case.

That's a good idea.

@vaibhawvipul
Copy link
Contributor Author

Thanks for working on this @vaibhawvipul.

I checked Spark's source code in the master branch, and it also hard codes nullable as true. This makes sense because if we perform a sum on a column that only contains nulls, then we would expect a null output.

It would be good to add a test in CometAggregateSuite to confirm that we have the same behavior as Spark in this case.

Parth raises a good point that we should check that we have the behavior for the ANSI overflow case, but that can be a separate issue/PR.

@andygrove I have added a test in CometAggregateSuite and we can observe that we have behaviour parity with spark. I can work on the issue raised by Parth in a separate PR. Please let me know your thoughts.

(-999.99, 999.99),
(-999.99, 999.99)
""")
checkSparkAnswer("SELECT SUM(col1), AVG(col1), SUM(col2), AVG(col2) FROM test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use checkSparkAnswerAndOperator here to make sure that the Comet query really is running with Comet expressions.

Suggested change
checkSparkAnswer("SELECT SUM(col1), AVG(col1), SUM(col2), AVG(col2) FROM test")
checkSparkAnswerAndOperator("SELECT SUM(col1), AVG(col1), SUM(col2), AVG(col2) FROM test")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SumDecimal always returns nullable=true
4 participants